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We consider Bayesian estimation of apxp precision matrix, when 
p can be much larger than the available sample size n. It is well known 
that consistent estimation in such ultra-high dimensional situations 
requires regularization such as banding, tapering or thresholding. We 
consider a banding structure in the model and induce a prior distri- 
bution on a banded precision matrix through a Gaussian graphical 
model, where an edge is present only when two vertices are within 
a given distance. We show that under a very mild growth condition 
and a proper choice of the order of graph, the posterior distribution 
based on the graphical model is consistent in the Loo-operator norm 
uniformly over a class of precision matrices, even if the true precision 
matrix may not have a banded structure. Along the way to the proof, 
we also establish that the maximum likelihood estimator (MLE) is 
also consistent under the same set of condition, which is of indepen- 
dent interest. We also conduct a simulation study to compare finite 
sample performance of the Bayes estimator and the MLE based on 
the graphical model with that obtained by using a banding opera- 
tion on the sample covariance matrix. We observe that the graphical 
model based estimators perform significantly better, especially if the 
banded sample covariance matrix is not positive definite. In contrast, 
the graphical model based estimators are always positive definite. 
Finally, we discuss a practical method of choosing the order of the 
graphical model using the marginal likelihood function. 

1. Introduction. Estimating a covariance matrix or a precision ma- 
trix (inverse covariance matrix) is one of the most important problems in 
multivariate analysis. Of special interest are situations where the number 
of underlying variables p is much larger than the sample size n. Such kind 
of situations are common in gene expression data, fMRI data and in sev- 
eral other modern applications. Special care needs to be taken for tackling 
such high-dimensional scenarios. Conventional estimators like the sample co- 
variance matrix or maximum likelihood estimator behave poorly when the 
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dimensionality is much higher than the sample size. 

Different regularization based methods have been proposed and developed 
in the recent years for dealing with high-dimensional data. These include 
banding, thresholding, tapering and penalization based methods to name 
a few; see, for example, [24, 19, 38, 3, 4, 20, 15, 32, 22, 31, 8, 6]. Most 
of these regularization based methods for high dimensions impose a sparse 
structure in the covariance or the precision matrix, as in [3] , where a rate of 
convergence has been derived for the estimator obtained by "banding" the 
sample covariance matrix, or by banding the Cholesky factor of the inverse 
sample covariance matrix, as long as n~ 1 logp — > 0. Cai, Zhang and Zhou 
[8] obtained the minimax rate under the operator norm and constructed a 
tapering estimator which attains the minimax rate over a smoothness class 
of covariance matrices. Cai and Liu [5] proposed an adaptive thresholding 
procedure. More recently, Cai and Yuan [7] introduced a data-driven block- 
thresholding estimator which is shown to be optimally rate adaptive over 
some smoothness class of covariance matrices. 

There are only a few relevant work in Bayesian inference for such kind 
of problems. Ghosal [16] studied asymptotic normality of posterior distribu- 
tions for exponential families when the dimension p — > oo, but restricting 
to p <C n. Recently, Pati et al. [29] considered sparse Bayesian factor mod- 
els for dimensionality reduction in high dimensional problems and showed 
consistency in the L2-operator norm (also known as the spectral norm) by 
using a point mass mixture prior on the factor loadings, assuming such a 
factor model representation of the true covariance matrix. 

Graphical models [23] serve as an excellent tool in sparse covariance or 
inverse covariance estimation; see [14, 27, 38, 15], as they capture the con- 
ditional dependency between the variables by means of a graph. Bayesian 
methods for inference using graphical models have also been developed, as 
in [34, 1, 26]. For a complete graph corresponding to the saturated model, 
clearly the Wishart distribution is the conjugate prior for the precision ma- 
trix fi; see [12]. For an incomplete decomposable graph, a conjugate family 
of priors is given by the G- Wishart prior [Roverato [34]]. The equivalent 
prior on the covariance matrix is termed as the hyper inverse Wishart dis- 
tribution in Dawid and Lauritzen [11]. Letac and Massam [26] introduced a 
more general family of conjugate priors for the precision matrix, known as 
the Wp G -Wishart family of distributions, which also has the conjugacy prop- 
erty. The properties of this family of distribution were further explored in 
[30]. Rajaratnam et al. [30] also obtained expressions for Bayes estimators. 

In this paper, we consider Bayesian estimation of the precision matrix 
working with a G- Wishart prior induced by a Gaussian graphical model, 
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which has a Markov property with respect to a decomposable graph G. For 
estimators arising from the resulting conjugacy structure, we establish their 
consistency and derive their posterior convergence rates. More specifically, 
we work with a Gaussian graphical model structure which induces banding 
in the corresponding precision matrix. Using this graphical model ensures 
the decomposability of the graph, along with the presence of a perfect set of 
cliques, as explained in Section 2. For a G-Wishart prior, we can compute the 
explicit expression of the normalizing constant of the corresponding marginal 
distribution of the graph (see Section 7). For arbitrary decomposable graphs, 
the computation of the normalizing constant requires Markov chain Monte- 
Carlo (MCMC) based methods; see [1, 9, 10, 25, 13]. 

The paper is organized as follows. In the next section, we discuss some 
preliminaries on graphical models. In Section 3, we formulate the estimation 
problem and the describe the corresponding model assumptions. Section 4 
deals with the main results related to posterior consistency and convergence 
rates. We extend the results proved in Section 4 to estimators using a ref- 
erence prior on the covariance parameter in the next section. A method for 
selecting the bandwidth parameter using the explicit form of the marginal 
likelihood of a graph inducing a banding structure with a given banding pa- 
rameter is discussed in Section 6. In Section 7, we compare the performance 
of the Bayesian estimator with the graphical maximum likelihood estimator 
(MLE) and the banding estimator as proposed by [4]. Proof of the main 
results are presented in Section 8. Some auxiliary lemmas and their proofs 
are included in the Appendix. 

2. Notations and preliminaries on graphical models. We first 
describe the notations to be used in this paper. By t n = 0(5 n ) (respectively, 
o(5 n )), we mean that t n /5 n is bounded (respectively, t n /5 n — > as n — > oo). 
For a random sequence X n , X n = Op(5 n ) (respectively, X n = op{5 n )) means 
that P(|X n | < M5 n ) — > 1 for some constant M (respectively, P(|A" n | < 
e5 n ) — > 1 for all e > 0). For numerical sequences r n and s n , by r n <C s n 
(or, r n ^> s n ) we mean that r n = o(s n ), while by s n > r n we mean that 
r n = 0(s n ). By r n x s n , we mean that r n = 0(s n ) and s n = 0(r n ), while 
r n ~ s n stands for r n /s n — > 1. The indicator function is denoted by H. 

We represent vectors in bold lowercase English or Greek letters. The com- 
ponents of a vector are represented by the corresponding non-bold letters, 
that is, for x G BP, ) . We define the following norms for 

/ \ l/r 

a vector x E W: \\x\\ r = (Xw=i \ x j\ r ) > \\ x \\oo = maxj \xj\. Matrices are 
denoted in bold uppercase English or Greek letters, like A = ((ajj)), where 
dij stands for the (i,j)ih entry of A. If A is a symmetric p x p matrix, 
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let eigi(A), . . . , eig p (A) stand for its eigenvalues. We consider the following 
norms on p x p matrices 

ll^llr = {Yh=i l a ij| r ) 1/r > 1 < r < oo, Halloo = maxjj |ay|, 
ll^ll^s) = sup{||Aa:|| s : [|sc[| r = 1}, 

by respectively viewing A as a vector in M p2 and an operator from (MP, || • || r ) 
to (MP, || • || s ), where 1 < r, s < oo. This gives 

11-4-11(1,1) = max i Ei l a d> ll^ll(oo l0 o) = maxiY^j\aij\ 
11-411(2,2) = {max(eig i (A T A): 1 < i < p)} 1/2 , 

and that for symmetric matrices, ||A||( 2 ,2) = max {| e igj(A)|: 1 < i < p}, 
and = || AH^^). The norm || • ||( nr ) will be referred to as the L r - 

operator norm. For two matrices A and B, we say that A > B (respectively, 
A > B) if A — B is nonnegative definite (respectively, positive definite). 
Thus A > for a positive definite matrix A, where stands for the zero 
matrix. The identity matrix of order p will be denoted by I p . A vector of l's 
is denoted by 1. 

Sets are denoted in non-bold uppercase English letters. For a set T, we 
denote the cardinality, that is, the number of elements in T, by #T. We 
denote the submatrix of the matrix A induced by the set T C {1,2, ... ,p} 
by At, i.e., At = (iflif i,j £ T)). By Ay , we mean the inverse (At) 1 of 
the submatrix At- For apxp matrix A = ((aij)), let (At) = ((a*j)) denote a 
p-dimensional matrix such that a*j = aij for (i, j) G T x T, and otherwise. 
Also we denote the "banded" version of A by Bj,(A) = ((ajjH{|i — j\ < k})) 
corresponding to banding parameter k, k < p. 

Now we discuss some preliminaries on graphical models. An undirected 
graph G consists of a non-empty vertex set V = {1,2, . . . ,p} along with an 
edge-set E C {(i, j) G V x V: i < j}. The vertices in V are the indices of 
the components of a p-dimensional random vector X = (X\, X%, . . . , X p ) T . 
The absence of an edge (i,j) corresponds to the conditional independence of 
Xi and Xj given the rest. For a Gaussian random variable X with precision 
matrix Q = ((wy)), this is equivalent to uJij = 0. Figure 1 illustrates the con- 
nection between a banded precision matrix and the corresponding graphical 
model. Following the notation in [26], we restrict the canonical parameter fi 
in Vgi where Vq is the cone of positive definite symmetric matrices of order 
p having zero entry corresponding to each missing edge in E. Denoting the 
linear space of symmetric matrices of order p by M, let M p C M be the 
cone of positive definite matrices. The linear space of symmetric incomplete 
matrices A = ((oy)) with missing entries a^, (i,j) £ E, will be denoted by 



CONVERGENCE RATES FOR ESTIMATING LARGE PRECISION MATRICES 5 



Xq. The parameter space of the Gaussian graphical model can be described 
by the set of incomplete matrices X! = /-c(fi _1 ), f2 6 Vg\ here, k: M — > Xq 
is the projection of M into Xq] see [26]. 
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Fig 1 . [Left] Structure of a banded precision matrix with shaded non-zero entries. [Right] 
The graphical model corresponding to a banded precision matrix of dimension 6 and banding 
parameter 3. 

A subgraph G' of G consists of a subset V of V and E' = G 
E: i,j E V'}. A maximal saturated subgraph of G is called a clique. A path 
in a graph is a collection of adjacent edges. A subset S of E is called a 
separator of two cliques C\ and C2, if all intermediate edges in every path 
from C\ to Ci must entirely lie in S. A graph is called decomposable if it 
is possible to find a set of cliques covering all vertices connected by a set of 
separators. We shall only deal with decomposable graphs in the paper. For 
detailed concepts and notations for graphical models, we refer the readers 
to [23]. A set of cliques C = {C\, C2, ■ ■ ■ , C r } are said to be in perfect order, 
if the following holds: For 

Hi = Ri = Ci , Ha = Cj U . . . U Cj , 

(2.1) ; 

= CjX-^'-i) Sj = Hj-i ^ Cj, j = 2, . . . ,r, 

S = {Sj, j = 2, . . . ,p} is the set of minimal separators in G. For a decompos- 
able graph, a perfect order of the cliques always exists. For a decomposable 
graph G with a perfect order of the cliques {Ci, C2, . . . , C r } and the pre- 
cision matrix ft is given to lie in Vg, the incomplete matrix S is defined 
in terms of the submatrices corresponding to the cliques, that is, for each 
i = 1,2, ... ,r, Sc. is positive definite. Thus we have the parameter space 
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for the decomposable Gaussian graphical models restricted to the two cones 

(2.2) V G = {A = W6M+: ay =0,(iJ)$E}, 

(2.3) Q G = {Be X G : B Ci > 0,t = l,2,. ..,r}, 

respectively for fi and S. 

The Wp G -Wishart distribution Wp G (ct, j3, D) has three set of parameters 
a, (3 and D, where a and (3 are suitable functions defined on the cliques 
and separators of the graph respectively, and D is a scaling matrix. The G- 
Wishart distribution Wg(S, D) is a special case of the Wp G -Wishart family 
where 

^ ^ ai = 2~ ' 1 = 1 ' 2 '"-' r ' 

Pi = ^~ ) i = 2, 3, . . . , r. 

3. Model assumption and prior specification. Let X\, X2, . . . , X n 

be independent and identically distributed (i.i.d.) random p-vectors with 
mean and covariance matrix Write Xi = (Xn, Xi2, ■ . . , Xi p ) T , and 
assume that the Xi, i = l,...,n, are multivariate Gaussian. Consistent 
estimators for the covariance matrix were obtained in [4] by banding the 
sample covariance matrix, assuming a certain sparsity structure on the true 
covariance. Our aim is to obtain consistency of the graphical MLE and 
and Bayes estimates of the precision matrix fi = under the condi- 

tion n _1 logp — > where ft ranges over some fairly natural families. For a 
given positive sequence j(k) 1 0, we consider the class of positive definite 
symmetric matrices ft = ((uy)) as 

U(eo, = i^fl: max ^^{tcfy-: \i — j\ > k} < j(k) for all k > 0, 

< £0 < mineigj(ri) < maxeigj(f2) < Sq 1 < 00 1 . 

We also define another class of positive definite symmetric matrices as 

V(K,i) = { n:maxV'{|w«|: \i-j\ > k} < 7 (A;)forallfc > 0, 
{ 3 i 

maxdiri^ 1 !!^^), ||r2|| {o0i0o) } < if j . 

These two classes are closely related, as shown by the following lemma. 



(3.1 
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Lemma 3.1. For every eq, there exist K\ < K2 such that 

(3.3) V(Xi,7)cW( eo ,7)cV(if 2 ,7). 

The sequence 'y(k) which bounds -E>fc(^)|| (00,00) nas been kept flexible 
so as to include a number of matrix classes. 

1. Exact banding: ^y(k) = for all k > ko, which means that the true 
precision matrix is banded, with banding parameter ko. For instance, 
any autoregressive process has such a form of precision matrix. 

2. Exponential decay: j(k) = e~ ck . For instance, any moving average 
process has such a form of precision matrix. 

3. Polynomial decay: 7(fc) = r yk~ a 1 a > 0. This class of matrices has 
been considered in Bickel and Levina [4]. 

We shall work with these two general classes U{eQ,"f) and V(K, 7) for esti- 
mating ft. A banding structure in the precision matrix can be induced by a 
Gaussian graphical model model. Since = implies that the components 
Xi and Xj of X are conditionally independent given the others, we can 
thus define a Gaussian graphical model G = (V,E), where V = {1, . . . ,p} 
indexing the p components X\, X2, ■ ■ ■ , X p , and E is the corresponding edge 
set defined by E = {(i, j): \i — j\ < k}, where k is the size of the band. This 
describes a parameter space for precision matrices consisting of /c-banded 
matrices, and can be used for the maximum likelihood or the Bayesian ap- 
proach, where for the latter, a prior distribution on these matrices must 
be specified. Clearly, G is an undirected, decomposable graphical model for 
which a perfect order of cliques exist, given by C = {C\, C2, ■ ■ ■ , C p _fc}, 
= {i'i + 1) ■ • • >i + &}> 3 = 1) 2, . . . ,p — k. The corresponding separa- 
tors are given by S = {S 2 , S3, ... , S p - k }, Sj = + 1, . . . ,j + k - 1}, 
j = 2, 3, . . . ,p — k. The choice of the perfect set of cliques is not unique, but 
the estimator for the precision matrix ft under all choices of the order re- 
mains the same. The Wp G -family, as a prior distribution for 5], is conjugate 
- if the prior distribution on ^ft is Wp G (a., f3, D), then the posterior dis- 
tribution of Tjfl given the sample covariance S = n _1 -X-iX? is given 
by Wp G (a — |l,/3 — |1, D + K(nS)). As mentioned earlier, the G-Wishart 
Wg(S, D) is a special case of Wp G (a, (3, D) for suitable choice of functions 
a and f3. In our case, #Cj = k + 1 for all i = 1, 2, . . . , p — k, and #Sj = k 
for all j = 2, 3, . . . ,p — k. Thus 

±k 

— , i — 1, 2, . . . ,p — k, 
+ k-l 

— , j = 2,3, ...,p-k. 



(3.4) 



a; 



Pi 
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The posterior mean of ft, given S is 



E(0|5) 



(3.5) 



p—k 

£( 

j'=i 



a,- 



;D + K (n5)) c ;)° 



p—k 



Y^(P 3 -l)((D + n{nS))- s ]) 



i=2 



Taking £} = Jp, the p dimensional indicator matrix, and plugging in the 
values of a and (3, the above estimator reduces to the Bayes estimator fi B 
with respect to the G-Wishart prior Wg(o~, I p ): 



B _ 5 + k + n 



n 



(3.6) 



p—k 



^((n-^ + ^r 1 ) 

J=l 

p ^((n^i k+ s Sj n o 

j=2 



1\0 



+ n~ 1 Y J {(n- 1 Ik + S Sj r 1 ) 

3=2 



The graphical MLE for Q under the graphical model with banding parameter 
k is given by 



(3.7) 



p—k p—k 
3=1 i=2 



4. Main results. In this section, we determine the convergence rate of 
the Bayes estimator of the precision matrix. An important step towards this 
goal is to find the convergence rate of the graphical MLE, which is also of 
independent interest. For high-dimensional situations, even when the sample 
covariance matrix is singular, the graphical MLE will be positive definite if 
the number of elements in the cliques of the corresponding graphical model 
is less than the sample size. Analogous results for banded empirical covari- 
ance (or precision) matrix or estimators based on thresholding approaches 
are typically given in terms of the L2-operator norm in the literature. We 
however use the stronger Loo-operator norm (or equivalently, Li-operator 
norm), so the implication of a convergence rate in our theorems is stronger. 

Theorem 4.1. Let X\, . . . , X n be random samples from a p- dimensional 
Gaussian distribution with mean zero and precision matrix fio £ ^( e o> 7) for 
some £q > and j(-). Then the graphical MLE f2 M of ft, corresponding to 
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the Gaussian graphical model with banding parameter k, has convergence 
rate given by 

(4.1) \\Q M - ft ||(oc,oc) = Op (max^n" 1 log p) 1/2 , 7 (*0}) • 

In particular, fi M is consistent in the Lao-operator norm if k — > oo such 
that k'^n^ 1 logp — > 0. 

The proof will use the explicit form of the graphical MLE and proceed 
by bounding the mean squared error in the Loo-operator norm. However, 
as the graphical MLE involves number of terms (k + l)(p — k/2) = 0(p), a 
naive approach will lead to a factor p in the estimate, which will not be able 
to establish consistency or a convergence rate in the truly high dimensional 
situations p S> n. We overcome this obstacle by looking more carefully at 
the structure of the graphical MLE, and note that for any row i, the number 
of terms in (3.7) which have non-zero ith row is only at most (2k + 1) <C p. 
This along with the description of Loo-operator norm in terms of row sums 
give rise to a much smaller factor than p. 

Now we treat Bayes estimators. Consider the G-Wishart prior Wg(S,I p ) 
for ft, where the graph G has banding of order k and 5 is a positive integer. 
The following result bounds the difference between f2 M and fl B . 

Lemma 4.2. Assume the conditions of Theorem 4-1 and suppose that fi 
is given the G-Wishart prior Wg(5,I p ), where the graph G has banding of 
order k. Then \\Q B - fi M || ( oo,oc) = P (k 2 /n). 

The proof of the above lemma is given in the Appendix. Theorem 4.1 and 
Lemma 4.2 together lead to the following result for the convergence rate of 
the Bayes estimator under the G in the Loo-operator norm. 

Theorem 4.3. In the setting of Lemma 4-2, the Bayes estimator satis- 
fies 

(4.2) - n ||(oo,oo) = Op (maxjfc 2 ^- 1 log p) 1 ' 2 ,7^)}) • 

In particular, the Bayes estimator fi B is consistent in the L^-operator norm 
if k — > oo such that k A n~ l logp — > 0. 

We now study the consistency and convergence rate of the posterior dis- 
tribution of the precision matrix given the data. The following theorem 
describes the behavior of the entire posterior distribution. 
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Theorem 4.4. In the setting of Lemma 1^.2, the 'posterior distribution 
of the precision matrix O satisfies 

(4.3) E [P {||0 - Oollfocoo) > M n e njk \X}] -> 1 

for e nt k = max{fc 2 (n _1 log p) 1 ^ 2 , 7(&)} and a sufficiently large constant 
M > 0. 

In particular, the posterior distribution is consistent in the L^- operator 
norm if k — > oo such that k^n^ 1 logp — > 0. 

Remarks on the convergence rates. Observe that the convergence rates 
of the graphical MLE, the Bayes estimator and the posterior distribution 
obtained above are the same. The obtained rates can be optimized by 
choosing k appropriately as in a bias-variance trade-off. The fastest pos- 
sible rates obtained from the theorems may be summarized for the different 
decay rates of "f(k) as follows: If the true precision matrix is banded with 
banding parameter ko, then the optimal rate of convergence n _1 / 2 (logp) 1//2 
is obtained by choosing any fixed k > ko. When 7(fc) decays exponen- 
tially, the rate of convergence n" 

-V2(logp)l/2(l og 

n) 2 can be obtained by 

choosing k n approximately proportional to log n with some sufficiently large 
constant of proportionality. If 7(/c) decays polynomially with index a as 
in [4], we get the consistency rate of (n _1 log p) a /( 2a + 4 ) corresponding to 
k n ^ (n- 1 logp)- 1 /^ 4 ). 

It is to be noted that we have not assumed that the true structure of the 
precision matrix arises from a graphical model. The graphical model is a con- 
venient tool to generate useful estimators through the maximum likelihood 
and Bayesian approach, but the graphical model itself may be a misspec- 
ified model. Further, it can be inspected from the proof of the theorems 
that the Gaussianity assumption on true distribution of the observations is 
not essential, although the graphical model assumes Gaussianity to generate 
estimators. The Gaussianity assumption is used to control certain probabil- 
ities by applying the probability inequality Lemma A. 3 of [4]. However, it 
was also observed by Bickel and Levina [4] that one only requires bounds 
on the moment generating function of Xf, i = 1, . . . ,p. In particular, any 
thinner tailed distribution, such as one with a bounded support, will allow 
the arguments to go through. 

5. Estimation using a reference prior. A reference prior for the 
covariance matrix X, obtained in [30], can also be used to induce a prior on 
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ft. This corresponds to an improper Wp G (a, (3, 0) distribution for ^fl with 

i = 1,2, . . . ,r, 



(5.1) 



a, = 0, 



(ci + c 2 ) - s 2 , /3j = -(c,- - Sj), 



i = 2,3,...,r. 



By Corollary 4.1 in [30], the posterior mean fi R of the precision matrix is 
given by 



R 



: £(S£)° - {1 - n- 1 ^ + c 2 - 2s 2 )}(Ss 2 1 ) 



(5.2) 



i=i 



^ { l_ n -i( Cj _ Sj)}( 5- 

j=3 



1\0 



Using this prior, we have an improvement in the Loo-operator norm of the 
difference between the Bayes estimator fi R and the graphical MLE fl M . 
However, this does not lead to any faster convergence rate of the Bayes 
estimator. 

Theorem 5.1. Under the reference prior mentioned above, 



(5.3) 



Mi 



I (00,00) 



Op (k*l 2 ln 



A sketch of the proof is given in Section 8. 

6. Bandwidth estimation. In this section, we propose a method of 
selecting the banding parameter k of the graphical model using the marginal 
posterior probabilities of the graph induced by banding k, A; = 1,2,.... For 
the G-Wishart prior Wg(S, D) for fi, the density is given by 



(6.1) 



p(n\G) = (I G (5,D))- 1 (det(fi)) 



(5-2)/2 



exp 



-tr(DO) 



where D is a symmetric positive definite matrix and 



(6.2) I G (S,D)= [ (det(ft)) (5 - 2)/2 exp 



tr(-DO) 



is the normalizing constant, which is finite for 5 > 2. The posterior is given 
by Wg(S + n,D + nS). Thus we can get the marginal likelihood for G as 



(6.3) 



p(X\G) = (2tt) 



-np/2 



I G (6 + n,D + nS) 
Ig(S,D) 
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For a complete graph G, Muirhead [28] showed that 

2 (S+p-l)p/2p I <H-p-A 

(6.4) I G (5, D) = J , 

where for a > (p - l)/2, F p (a) = vr^" 1 )/ 4 nf=o r ( a ~ D and r (0 denotes 
the gamma function. Roverato [34] showed that for a decomposable graph 
G 

(6,, ^-n-w*' 



where {Ci, . . . , C r } and {52, • • • , S r } denote the set of cliques and separators 
respectively corresponding to G. 

In our case, the model which is fit has a banded structure of the precision 
matrix, and corresponding to banding parameter k. We denote the graphical 
model induced by banding parameter k by G k . Let be a prior for the 
banding parameter k. Then the corresponding posterior probability is given 

by 

(6.6) P(MX) = ^f%ggg^ , 

where 

I G k (5 + n,D + nS) 



(6.7) J Gk (5,n,D,nS) 



I G k(S,D) 



Let the cliques and separators be respectively denoted by Cj = {j,j + 



1, ...,j+k}, j = l,...,p-k,&ndSf = {j,j+l,...,j+k-l},j = 2,,..,p-k. 
Note that the sub-graphs corresponding to the cliques and separators are 
complete, with respective dimensions k + 1 and k, and r = p — k. Therefore 
(6.4) and (6.5) together lead to 

_ ri^2 (m)(fc+1)/2 r fc+ i (g) (det(j^))(^)/ 2 
1 j GH ' j n P- 2 fc 2 (^-i)fe/2r fc (^i) (det(£> Cj )) (5+fc)/2 ' 

Now, with the choice D = I p used in the prior W G (5, 1), (6.7) gives 



J G k(S,n, I v ,nS) — 




(6.9) x 



nS(det((/ p + n5) 5fc ))(^- 1 )/ 2 



n^(det((I p + nS) cfc ))( 5 +™+ fc V 2 
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Substituting this expression in (6.6), we get an explicit expression for the 
posterior distribution of k. 

A natural method of selecting k is to consider the posterior mode. It can 
be located by detecting where the value of the ratio of 

p k J G k{S,n,I p ,nS) 

drops below 1. Using the expressions given above, it follows that the ratio 
is given by 



(fc+l -p (S+n+i\ 
L l o rffl 
( r 



2 




nf=2~ 1 (det((/ p + n5) sr ))^ 
X n?;f _1 (det((/ p + nS) ck+ i))V+ n + k + l V 2 

U%i(det((I p + nS) c ,))( s + k +^/ 2 
X (det((I p + nS) s ^+n+k-D/2 ■ 

Using properties of the gamma function, the expression above simplifies to 



Pk+l 



Pk \B(l,?±%±k) 



( 6 - 10 ) 



n^2 _1 (det((I P + n5) s Hi))( W+n »/ 2 



n?ir (det((/ p + n5) rfe+1 ))( 5 +«+ fe +D/ 2 

n^(det((J p + n5) c ,))(^)/ 2 _ 
U"Z2(det((I p + nS) sk ))i^ + k-i)/2 ] 



here .£?(•, •) denotes the beta function. The expression can be easily numeri- 
cally evaluated to obtain the posterior mode. 

Although the expression for the posterior probability of k is cumbersome 
to analyze theoretically, we observe that the posterior distribution will con- 
verge at the stated rates adaptively (i.e., without knowing the optimal order 
k n of k), if it can be shown that the posterior distribution of k concentrates 
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in the range [cik n , C2k n ] for some constants c\ < ci- Hierarchical prior distri- 
butions formed by a prior on a smoothing parameter often leads to adaptive 
posterior convergence rates as if the true smoothness is known, sometimes 
up to a logarithmic factor; see [2, 17, 18, 37, 33, 21, 36, 35]. Under some 
appropriate tail conditions on we expect this to hold. In that case, a very 
natural adaptive procedure for estimating large covariance matrices will be 
available through the Bayesian approach. 

7. Numerical results. We check the performance of the Bayes esti- 
mator of the precision matrix and compare with the graphical MLE and the 
banded estimator as proposed in [4]. 

We compare the Bayes estimator of the precision matrix and the corre- 
sponding estimator of the covariance matrix with the respective estimates 
given by the other two methods as mentioned above. Data is simulated from 
N(0, S), assuming specific structures of the covariance X! or the precision f2. 
For all simulations, we compute the L2- operator norm of the difference be- 
tween the estimate and the true parameter for sample sizes n = 50, 100, 200 
and p = n/2, ra, 2ra, 5n, representing cases like p < n, p ^ n, p > n and 
p>n, We simulate 100 replications in each cases. Some of the simulation 
models are the same as those in [4]. 

Example 7.1 (Autoregressive process: AR(1) covariance structure). Let 
the true covariance matrix have entries given by 

(7.1) <ri3=ft~ S \ l<i,3<P, 

with p = 0.3 in our simulation experiment. The precision matrix is banded 
in this case, with banding parameter 1. 

Example 7.2 (Autoregressive process: AR(4) covariance structure). The 
elements of true precision matrix are given by 

Uij =H\i - j\ = 0) + 0.4 Tl(\i - j\ = 1) + 0.2 ll(|i - j\ = 2) 
+ 0.2 - j\ = 3) + 0.1 llfli - j\ = 4). 

This is the precision matrix corresponding to an AR(4) process. 

Example 7.3 (Long range dependence). We consider a Fractional Gaus- 
sian Noise process, that is, the increment process of fractional Brownian 
motion. The elements of the true covariance matrix are given by 

(7.3) a l3 = + I?" ~ 2|i - j\ 2H + |I* - j\ - lH, 1 < i,j < P, 
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where H £ [0.5, 1] is the Hurst parameter. We take H = 0.7 in the simulation 
example. This precision matrix does not fall in the polynomial smoothness 
class used in the theorems. We include this example in the simulation study 
to check how the proposed method is performing when the assumptions of 
the theorems are not met. 

Table 1 shows the simulation results for the different scenarios and com- 
pares the performance of the Bayes estimator with the MLE and the banded 
estimator (denoted by BL) in terms of the L2-operator norm of the difference 
of the precision and covariance matrices from their respective true values. 
The maximum likelihood and Bayes estimates of the covariance matrix is 
obtained by inverting the estimated precision matrix, while for the band- 
ing approach the estimate of the precision matrix is obtained by inverting 
the estimated covariance matrix. The first column in the table with entries 
respectively Jli,f22>^3 stand for the situations where the data generating 
mechanism follow the processes respectively in Example 1, Example 2 and 
Example 3. The estimates for the first two examples have been computed 
using the value of the banding parameter of the true precision matrix. For 
Example 3, we used k = 1, the value which apparently gave the best result. 



8. Proofs. In this section we provide the proofs of the theorems and 
lemmas stated in Section 4. Proofs of these results will require some addi- 
tional lemmas and propositions, which we include in the Appendix. 

Proof of Theorem 4.1. The Loo-operator norm of the difference be- 
tween the graphical MLE f2 M and the true precision matrix Q,q can be 
written as 



(8.1) \\Q M - n ||(oo,oo) < ll^ M - ^fc(^o) || (00,00) + 11^0 - Sfc(n ) II (00,00) • 



As shown in [23], in a graphical model 



p—k 




p—k 



p—k 



p—k 



i=2 



i=2 
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Hence the first term can be written as 



p—k 



p—k 



p—k 



p—k 



£(0 -E(0 -£(^) 

j=l j=2 j=l j=2 



< 



p—k 

£ 



+ 



(00,00) 



p—k 



i=2 



(00,00) 




(00,00) 



Let us first bound the first term. Using the fact that there are only (2k + 1) 
terms in above expressions inside the norms which have a given row non-zero, 
it follows that 



p—k 



(00,00) 



max 

V 
p—k 



.7=1 k 



< max 



S^ 1 



< (2k + 1) max max [(^c 1 



(1,1') 
-1 



(l,l') 



(8.2) 



(2fc + 1) max 



Sc 1 ~ = 



(00,00) 



where the subscript (Z, Z') on the matrices above stand for their respective 
(Z,Z')th entries. Using the multiplicative inequality < ||A||[|iJ[| of 

operator norms, we have 



max 
3 



(00,00) 



(8.3) 



= max 
3 

< max 

3 



(00,00) 



J c, 



(00,00) 



*5.' 



(00,00) 



By assumption on the class of matrices and Lemma 3.1, 



V, 



(00,00) 



is 
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bounded by K%. From Lemma A. 3, 



P ( max S^ 1 



> Mi) < p maxP 

(00,00) / j 



(00,00) 



> Mi 



< M[pk 2 exp[—niink 2 ] 
for some constant Mi, M[,mi > 0, while from Lemma A. 2, 

u2„ r i.~2.2i 



P ( max || Sc. - ScJ (oooo) > t ) < M 2 pfc 2 exp[-m 2 nk' z t z 



for |i| < m' 2 for some constants M2,m2,m'2 > 0. 

We choose i = Akfa' 1 log p) 1 / 2 for some sufficiently large A to get the 
bound 



(8.4) 



7 = 1 V 7 



(00,00) 



By a similar argument, we can establish 



(8.5) 



p—k 

£ ( ( s i' 



j=2 



P (^(n^logp) 1 / 2 ) . 



P (V^logp) 1 / 2 ) . 



(00,00) 



Therefore, in view of the assumption ||fio ~~ -Sfe(^o) ll(oo,oo) < we obtain 
the result. □ 



PROOF of Lemma 4.2. The Loo-operator norm of fi B —fi M can be bounded 



by 
(8.6) 

(8.7) 



p—k 

i=2 



•1\0 



(oo,oo) 



5 + k + n 



n 



5 + k + n 



n 



i=i 

p—k 



3=1 
p—k 



(oo,oo) 



(8.9) 



+ 



5 + k + n 



v 



3=2 3=2 

p—k p—k 

E(W- 1 )°-E((^r 1 ) 



(00,00) 



(00,00) 
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Now, (8.6) above is 



— max 
n i ^ 



p—k 



2 {(n~ l I k + S 6 



3=2 
p—k 



(1,1') 



j=2 V 



(1,1') 



< 



2k + 1 



max max ^ | [{n 1 I k + S Sj ) x ] ^ 



n j 

— — max || (n h + S Sj ) || (00)00) , 



which is bounded by a multiple of 

k 3/2 



n 3 



max 1 1 (n 1 I k + S Sj 



-H 



1(2,2) 



(8.10) 



fc 3/2 

< max 

n i 

fc 3/2 

< 



Sc. 1 



(2,2) 



max 

n i 



(oo,oo) 



In view of Lemma A. 3, we have that for some M3, ML 772,3 > 0, 



P I max So. 



(00,00) 



,2„ ;„-2i 



>M 3 )< M^pk 2 exp[— m 3 nk 



which converges to zero if k 2 (logp)/n — > 0. This leads to the estimate 



(8.11) 



n 



-1 



^{(n^h + Ss,)- 1 )" 
i=2 



Op 



(00,00) 



For (8.7), we observe that 



i=i i=i 



(00,00) 



<(2A; + l)max (n" 1 !^! + 5 C ,) _1 - 



(00,00) 



< /c 3//2 max 
j 



(2,2) 
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and that 



(n 1 Jfc+i + So* 



,-i 



(2,2) 



' ' II,--) \\ n -^+l||( 2 ,2) 



(2,2) 



< n 



-i 



S^ 1 



< n- 1 



(2,2) 



(oo,oo) 



Now under k 2 (logp)/n — > 0, an application of Lemma A. 3 leads to the bound 
P (k 3 / 2 /n) for (8.7). 

A similar argument gives rise to the same Op(k 3 ^ 2 jn) bound for (8.8). 

Finally to consider (8.9). As argued in bounding (8.6), we have that 



p—k 

£((« 



c, 



- 1 )°-E((^r 1 ) 



J'=2 



(00,00) 



< (2k + 1) 



max 

3 



SB' 



+ max 

(00,00) j 



7-1 



(00,00) 



under the assumption /c 2 (logp)/n — )• by another application of Lemma 
A.3. Since n~ 1 (5 + k + n) - 1 = 0(k/n), it follows that (8.9) is P (k 2 /n), 
which is the weakest estimate among all terms in the bound for ||fi — fi ||. 
The result thus follows. □ 



Proof of Theorem 4.3. The proof directly follows from Theorem 4.1 
and Lemma 4.2 using the triangle inequality. □ 



Proof of Theorem 4.4. The posterior distribution of the precision ma- 
trix fi given the data X is a 67- Wishart distribution Wg(S + n,I p + nS), 
which means that the submatrix ttc for any clique Cj has a Wishart distri- 
bution with parameters 5+n and scale matrix (I p +nS)c j , j = 1, ■ ■ • , p—k. In 
particular, if i G Cj, then Ua has chi-square distribution with (5 + n) de- 
grees of freedom, where Tj n is the (i,i)th entry of ((I + Sc^ 1 ) ■ Fix a clique 
C = Cj and define T n = diag(wjj: i G C). For i, j G C, let oj*j = ujij / ^/T in Tj n 
and Ct c = i,j G C)). Then fij^ given X has a Wishart distribution 

with parameters 5 + n and scale matrix T n (Ik+i + nSc)T n 

We first note that maxjTj n = Op(n~ l ). To see this, observe that + 
n^c*) -1 < ti^Sq , so that 

max|r in | < — 1| S^^ 1 1| (2,2) = Op(n _1 ) 
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in view of Lemma A. 3. On the other hand, from Lemma A. 2, it follows that 
maxc 1 1 Sell (2,2) = Op(l), so with probability tending to one, Sc < Lie, and 
hence (I + nS)^ 1 > (1 + nL)~ l Ic simultaneously for all cliques, for some 
constant L > 0. Hence maxj = Op(n). Consequently, with probability 

^ /2 i /2 

tending to one, the maximum eigenvalue of T n (Ik+i + nS c )T n is 
bounded by a constant depending only on eo, simultaneously for all cliques. 
Hence applying Lemma A. 3 of [4], it follows that for all i, j, 

(8.12) P[\u>ij -E(uij\X)\ >t}< M 4 exp[-m 4 ((5 + n)t 2 ], \t\ < m' 4 , 

for some constants M^m^m!^ > depending on eo only. 

Now, as a G-Wishart prior gives rise to a fc-banded structure, as arguing 
in the bounding of (8.6), we have that, for some M5, m^,m'^ > 0, and all 

1*1 < m 5i 



(8.13) 



jllO-O 1 



| (00j00) > t\x} < M 5 pe W [-m 5 n(2k + l)~ 2 t 2 ]. 

The reduction in the number of terms in the rows from p to (2k + 1) arises 
due to the fact that the G-Wishart posterior preserves the banded structure 
of the precision matrix. Choosing t = ^4(n _1 logp) -1 / 2 , with A sufficiently 
large, we get 

(8.14) E [P{||O - n B || (0O|0o) > Akin- 1 logp)- 1 / 2 !*}] -»• 0. 

Therefore, using Theorem 4.3, 
E [P {lin-flollcocoo) >2e n |X}] 



< Pc 



|O b -O 



II (00,00) 



> e„|x| +E pjllO-O 1 



I (00,00) 



which converges to zero if e n = max{Ak(n 1 logp) 1 / 2 ,7(fc)}. 



□ 



Proof of Theorem 5.1. In our scenario, the Bayes estimator under 
the reference prior is given by the expression 

p—k 



n R = E(n\s) 

Therefore 
||O r -O m 



p—k 

£ 



-1\0 



[I - n- l ){S- s lf - [l - n-^iSSlf 

i=3 



(00,00) 



n 



< n 



p—k 

£< 

i=2 



n 



p—k 

'£ 

3=3 



(s; 



-1\Q 



-1\0 



+ n 



(00,00) 



(00,00) 



(00,00) 
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The rest of the proof proceeds as in Lemma 4.2. □ 

APPENDIX A: PROOFS OF AUXILIARY RESULTS 

In this section we give proofs of some lemmas we have used in the paper, 
which are of some general interest. 

The first lemma deals with the various equivalence conditions for matrix 
norms and is easily found in standard textbooks. 

Lemma A.l. For a symmetric matrix A of order k, we have the follow- 
ing: 

1- 11^11(2,2) < ll^ll(oo,oo) < V^||A||( 2) 2); 

2. \\A\\oo < ||A|| (2i2 ) < ||A|| (o0j0o) < fcHAIloo; 

Now we prove the lemma concerning the equivalence of the classes of 
matrices considered for the precision matrix ft. 

Proof of Lemma 3.1. We rewrite the class of matrices defined in (3.1) 

as 

U(eq, 7) = i^fl: max^{o;y: \i — j\ > k} < j(k) for all k > 0, 

maxjiirr 1 !!^), ||fi||( 2 , 2 )} < ^ 1 \ ■ 



(A.l) 



Now, max{||fi 1 1| (00,00) , II ^11 (00,00)} < K l implies max { \\fl 1 ||( 2 ,2), 11^11(2,2)} < 
£q 1 for K\ = £q 1 , using Lemma A.l. Thus V(-£Ci,7) C U(eq,j). 
To see the other way, note that, for any fixed ko, 

II^II(00,00) < 11^ _ ^feo( fi ) II (00,00) + 11^0(^)11(00,00) 

< 7 (M + (2fc + l)||n||oo 

< 7(fc ) + (2fc + l)||«||(2,2) 

(A.2) < j(k ) + (2k Q + 1)£q X . 

Choosing K 2 = j(k ) + (2k + l)e^ gives W(e ,7) C V{K 2:1 ). □ 

Lemma A. 2. Let Zi, i = 1, . . . , n, be i.i.d. k-dimensional random vectors 
distributed asN(0,D) and max {||-D || (00,00) > ||-D||(oo,oo)} — K. Then for 
the sample variance S n = Y27=l ^iZf , we have 

(A. 3) P \\S n -D\\,.>t < Mk 2 exp(-mnk~ 2 t 2 ), \t\ < m , 
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where M, m, m! > depend on K only. 

In particular, if k 2 (logk)/n — > 0, then ||«SVi||(oo,oo) = Op(l). 

Proof. The proof directly follows from Lemma A. 3 of [4] and noting 
from Lemma A.l that \\S n — D\\(po,oo) — k\\S n — -D||oo- D 

Lemma A. 3. Let Z{, i = 1, . . . , n, be i.i.d. k- dimensional random vectors 
distributed asN(0,D) and max {||-D ||(oo,oo)> ||-D||(oo,oo)} ^ K- Then for 
the sample variance S n = Yli=l ^iZf , we have 

(A.4) P [HS- 1 !!^) > M] < M'k 2 eM-mnk- 2 C' 2 ), 

where M > K and M', m > depend on M and K only. 

PROOF. Note that, 

II <j — 111 <- || T~) — 1|| i || C 1 " 1 J~)~ ^11 

W^n ll(oo,oo) — W-'S ll(oo,oo) ' II "ri J -' ll(oo,oo) 

11-^ ll(oo,oo) "I - 11-^ II (oo,oo) II -^H (oo,oo) II ||(oo,oo) 

(A.5) < K(l + \\S n -D\\ 

(oo,oo) II ll(oo.oo)) - 

This implies that 

119-111 < - 

" " " (0O ' 0O) - 1-ll-Sn- D\\ (OOJBO) K' 

Thus, using Lemma A. 2, we obtain 

K 



(A.6) 



pDIS- 1 !!^) >m] <p 



< p 



> M 



1 \\Sn D ||(oo,oo)-^" 
llfti" -011(00,00) >^" 1 "^ 1 



< M'k 2 exp(—m 2 nk z 1 



REFERENCES 



□ 



[1] Atay-Kayis, A. and Massam, H. (2005). A Monte-Carlo method for computing the 
marginal likelihood in nondecomposable Gaussian graphical models. Biometrika 92 
317-335. 

[2] Belitser, E. and Ghosal, S. (2003). Adaptive Bayesian inference on the mean of 
an infinite-dimensional normal distribution. Ann. Statist. 31 536-559. 

[3] Bickel, P. J. and Levina, E. (2008a). Covariance regularization by thresholding. 
Ann. Statist. 36 2577-2604. 



CONVERGENCE RATES FOR ESTIMATING LARGE PRECISION MATRICES 23 



Bickel, P. J. and Levina, E. (2008b). Regularized estimation of large covariancc 
matrices. Ann. Statist. 36 199-227. 

Cai, T. and Liu, W. (2011). Adaptive thresholding for sparse covariance matrix 
estimation. J. Amer. Statist. Assoc. 106 672-684. 

Cai, T., Liu, W. and Luo, X. (2011). A constrained ^-minimization approach to 
sparse precision matrix estimation. J. Amer. Statist. Assoc. 106 594-607. 
Cai, T. T. and Yuan, M. (2012). Adaptive covariance matrix estimation through 
block thresholding. Ann. Statist, (to appear). 

Cai, T. T., Zhang, C. H. and Zhou, H. H. (2010). Optimal rates of convergence 
for covariance matrix estimation. Ann. Statist. 38 2118-2144. 

Carvalho, C. M., Massam, H. and West, M. (2007). Simulation of hyper-inverse 
Wishart distributions in graphical models. Biometrika 94 647-659. 
Carvalho, C. M. and Scott, J. G. (2009). Objective Bayesian model selection in 
Gaussian graphical models. Biometrika 96 497-512. 

Dawid, A. P. and Lauritzen, S. L. (1993). Hyper Markov laws in the statistical 
analysis of decomposable graphical models. Ann. Statist. 21 1272-1317. 
DlACONlS, P. and Ylvisaker, D. (1979). Conjugate priors for exponential families. 
Ann. Statist. 7 269-281. 

Dobra, A., Lenkoski, A. and Rodriguez, A. (2011). Bayesian inference for general 
Gaussian graphical models with application to multivariate lattice data. J. Amer. 
Statist. Assoc. 106 1418-1433. 

Dobra, A., Hans, C, Jones, B., Nevins, J. R., Yao, G. and West, M. (2004). 
Sparse graphical models for exploring gene expression data. J. Multivariate Anal. 90 
196-212. 

Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance 
estimation with the graphical lasso. Biostatistics 9 432-441. 

Ghosal, S. (2000). Asymptotic normality of posterior distributions for exponential 
families when the number of parameters tends to infinity. J. Multivariate Anal. 74 
49-68. 

Ghosal, S., Lember, J. and van der Vaart, A. (2008). Nonparametric Bayesian 
model selection and averaging. Electron. J. Stat. 2 63-89. 

Huang, T. M. (2004). Convergence rates for posterior distributions and adaptive 
estimation. Ann. Statist. 32 1556-1593. 

Huang, J. Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance matrix 
selection and estimation via penalised normal likelihood. Biometrika 93 85-98. 
Karoui, N. E. (2008). Operator norm consistent estimation of large-dimensional 
sparse covariance matrices. Ann. Statist. 36 2717-2756. 

Kruijer, W., Rousseau, J. and van der Vaart, A. (2010). Adaptive Bayesian 
density estimation with location-scale mixtures. Electron. J. Stat. 4 1225-1257. 
Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariancc 
matrix estimation. Ann Statist. 37 4254. 

Lauritzen, S. L. (1996). Graphical Models 17. Oxford University Press, USA. 
Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional 
covariance matrices. J. Multivariate Anal. 88 365-411. 

Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in 
Gaussian graphical models with the G- Wishart prior. J. Comput. Graphical Statist. 
20 140-157. 

Letac, G. and MASSAM, H. (2007). Wishart distributions for decomposable graphs. 
Ann. Statist. 35 1278-1323. 

Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and variable 



24 BANERJEE AND GHOSAL 



selection with the lasso. Ann Statist. 34 1436-1462. 
[28] Muirhead, R. (2005). Aspects of Multivariate Statistical Theory. Wiley, New York. 
[29] Pati, D., Bhattacharya, A., Pillai, N. S. and Dunson, D. (2012). Posterior 

contraction in sparse Bayesian factor models for massive covariance matrices. 
[30] Rajaratnam, B., Massam, H. and Carvalho, C. M. (2008). Flexible covariance 

estimation in graphical Gaussian models. Ann Statist. 36 2818-2849. 
[31] Rothman, A. J., Levina, E. and Zhu, J. (2009). Generalized thresholding of large 

covariance matrices. J. Amer. Statist. Assoc. 104 177-186. 
[32] Rothman, A. J., Bickel, P. J., Levina, E. and Zhu, J. (2008). Sparse permutation 

invariant covariance estimation. Electron. J. Statist. 2 494-515. 
[33] ROUSSEAU, J. (2010). Rates of convergence for the posterior distributions of mixtures 

of Betas and adaptive nonparametric estimation of the density. Ann. Statist. 38 146- 

180. 

[34] Roverato, A. (2000). Cholesky decomposition of a hyper inverse Wishart matrix. 

Biometrika 87 99-112. 
[35] Shen, W. and Ghosal, S. (2012). Adaptive Bayesian procedures using random series 

prior. 

[36] Shen, W., Tokdar, S. T. and Ghosal, S. (2012). Adaptive Bayesian multivariate 

density estimation with Dirichlet mixtures. 
[37] van der Vaart, A. and van Zanten, H. (2009). Adaptive Bayesian estimation 

using a Gaussian random field with inverse Gamma bandwidth. Ann. Statist. 37 

2655-2675. 

[38] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graph- 
ical model. Biometrika 94 19-35. 



Department of Statistics 

North Carolina State University 

5219 SAS Hall, 2311 Stinson Drive 

Raleigh, NC 27695-8203 

U.S.A. 

E-MAIL: sbanerj5@ncsu.edu 



Department of Statistics 

North Carolina State University 

4276 SAS Hall, 2311 Stinson Drive 

Raleigh, NC 27695-8203 

U.S.A. 

E-MAIL: sghosal@stat.ncsu.edu 



CONVERGENCE RATES FOR ESTIMATING LARGE PRECISION MATRICES 25 



Table 1 

Simulation results for different structures of precision matrices 



CI 


n 






iin — si\ 






£ - S 2 




MLE 


Bayes 


BL 


MLE 


Bayes 


BL 






25 


1.422 


1.486 


335.411 


1.068 


0.918 


1.137 




50 


50 


1.680 


1.736 


79.017 


1.299 


1.100 


1.302 






100 


1.947 


1.991 


432 036 


1.511 


1.214 


1.342 






250 


2.038 


2.071 


512 613 


1.671 


1.302 


1.434 






25 


1.799 


2.168 


8.266 


1.503 


1.215 


1.386 




50 


50 


2.335 


2.712 


33.876 


1.830 


1.351 


1.588 






100 


2.703 


3.089 


41.197 


1.966 


1.432 


1.666 






250 


2.868 


3.259 


1 86 583 


2.221 


1.509 


1.940 






25 


1.166 


1.267 


4.115 


1.776 


1.947 


2.047 




50 


50 


1.301 


1.399 


4.747 


2.970 


3.144 


3.233 






100 


1.517 


1.613 


8.806 


4.430 


4.618 


4.726 






250 


1.593 


1.685 


10.985 


7.239 


7.431 


7.539 






50 


1.119 


1.143 


124.327 


0.732 


0.710 


1.120 




100 


100 


1.149 


1.191 


578.593 


0.885 


0.840 


1.199 






200 


1.271 


1.311 


841 829 


1.045 


0.918 


1.227 






500 


1.402 


1.436 


1 636 Q8Q 


1.190 


1.016 


1.266 






50 


1.166 


1.343 


5.913 


1.217 


1.012 


1.113 




100 


100 


1.253 


1.442 


8.715 


1.366 


1.107 


1.240 






200 


1.386 


1.578 


32 952 


1.448 


1.162 


1.305 






500 


1.535 


1.734 


39 268 


1.560 


1.238 


1.423 






50 


0.937 


0.991 


3.228 


2.898 


2.997 


3.189 




100 


100 


0.976 


1.030 


3.607 


4.422 


4.523 


4.721 






200 


1.309 


1.362 


4.541 


6.464 


6.566 


6.772 






500 


1.152 


1.204 


4.603 


10.169 


10.271 


10.470 






100 


0.692 


0.706 


387.333 


0.657 


0.627 


1.096 


ft! 


200 


200 


0.791 


0.815 


689.974 


0.696 


0.647 


1.130 






400 


0.834 


0.857 


1653.393 


0.765 


0.711 


1.171 






1000 


0.916 


0.939 


7196.422 


0.799 


0.759 


1.200 






100 


0.738 


0.828 


5.564 


0.893 


0.826 


0.860 




200 


200 


0.801 


0.892 


6.529 


1.027 


0.942 


0.986 






400 


0.995 


1.090 


8.069 


1.095 


0.977 


1.031 






1000 


1.043 


1.139 


8.611 


1.124 


0.979 


1.041 






100 


0.621 


0.648 


2.083 


4.432 


4.484 


4.724 




200 


200 


0.706 


0.734 


2.300 


6.454 


6.507 


6.753 






400 


0.750 


0.777 


2.516 


9.134 


9.187 


9.434 






1000 


0.815 


0.843 


2.818 


14.029 


14.082 


14.330 



